Speed up CI and test adjoint in parallel by cpjordan · Pull Request #438 · thetisproject/thetis

cpjordan · 2025-12-28T18:45:11Z

Depends on #437. Closes #426.

Testing adjoint in (MPI) parallel:
For the MPI parallel adjoint tests, I have used the channel-optimisation example and copied the mesh across, because I couldn’t figure out how to add a subdomain to a RectangleMesh. We could move this into a channel-optimisation directory if that’s cleaner, or alternatively switch to a simplified headland inversion example with regions defined as Constant controls. Either way, this requires a duplicate test. With the current example testing setup, we don’t have a way to select whether tests are run in serial or parallel. This could be updated so that along with each adjoint example, we specify whether it is serial or parallel, that would avoid the duplication.

I've noticed that every now and again the Thetis MPI parallel tests hang and can take hours to complete, it might be worth also adding @pytest.mark.timeout(300) for all our parallel tests. I don't think this is because of Thetis itself but just due to MPI collectives or repeated pytest-xdist collection hanging.

Speeding up CI test suite:
It’s not possible to split matrix jobs on a single runner (as far as I can tell). I’ve implemented a matrix strategy that splits between runners, but we could revert to running everything on a single runner if preferred. Another option would be to merge the main and adjoint serial tests, although I prefer keeping them separate for clarity and cleaner outputs. If we stick with this then we need to update the tests that are required to pass for merging.

connorjward

also adding @pytest.mark.timeout(300) for all our parallel tests

I recommend setting --timeout 300 --timeout-method thread for this (example).

cpjordan · 2026-01-28T14:24:50Z

Thanks for the comments @connorjward - I'll take a look in detail again soon and probably have some follow up questions!

cpjordan · 2026-05-05T16:19:59Z

Hi @connorjward - "soon" was a lot later than I thought it would be. Would you mind reviewing this?

I recommend setting --timeout 300 --timeout-method thread for this (example).

I've done this.

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

connorjward · 2026-05-06T08:51:12Z

I think the main query I have is whether you can split matrix jobs on a single runner (see original description)? I want to keep the standard and adjoint tests separate, but this leads to using two runners (fine with me) even though each runner has e.g. 8 cores, so you should theoretically be able have the standard tests on 4 cores and the adjoint on the other 4. As far as I can tell, GitHub doesn't allow this?

Yeah I'm pretty sure that splitting like that doesn't fit with what GitHub allows. I do wonder why they are separate jobs though. You are repeating the same setup and teardown for equivalent configurations. I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

cpjordan · 2026-05-06T10:10:09Z

I would advocate for testing regular and adjoint in separate steps, as opposed to separate jobs.

This is what we currently have, which is what causes testing to be so slow:

Slowest regular tests (667x):
- 444.81s call test/examples/test_examples.py::test_examples[/__w/thetis/thetis/thetis-repo/examples/discrete_turbines/tidal_array.py]
- 337.21s call test/swe2d/test_rossby_wave.py::test_convergence[DIRK22-bdm-dg]
- 211.42s call test/swe2d/test_rossby_wave.py::test_convergence[CrankNicolson-bdm-dg]
- 174.64s call test/swe2d/test_rossby_wave.py::test_convergence[SSPRK33-bdm-dg]
Slowest adjoint tests (7x):
- 1487.55s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py2] (examples/tohoku_inversion/inverse_problem.py)
- 121.16s call test_adjoint/examples/test_examples.py::test_examples[channel-optimisation.py]
- 82.02s call test_adjoint/examples/test_examples.py::test_examples[inverse_problem.py1]

So because the slow adjoint test isn't tested alongside the regular tests, it dominates the CI time. This slow test (tohoku_inversion/inverse_problem.py) used to (wrongly) be part of the regular tests (https://github.com/thetisproject/thetis/actions/runs/24021170763/job/70050221687#step:9:1488) which had the CI running in ~20-25 minutes. tohoku_inversion/inverse_problem.py also used to be 44% faster but that's a separate problem.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately. It would keep the diagnostic separation between adjoint and regular tests but speed things up by separating the examples from the non-example tests. Merging the examples together would then allow tohoku_inversion/inverse_problem.py to run with the other slow tests in the examples.

Thoughts @stephankramer?

connorjward · 2026-05-06T10:21:07Z

Ah sorry, I missed the motivation behind all this.

Perhaps the solution is to split between regular tests, adjoint tests and then the examples separately.

This does seem more natural to me.

Instead of having separate jobs maybe another approach is to run the non-example tests as an earlier step and only do the examples at the end. That way you still get fast feedback if things are breaking.

cpjordan · 2026-05-22T12:33:38Z

The latest commit is to just tackle the main problem for CI speed directly, which is the Tohoku tsunami example. From my understanding:

The Okada source treats the rupture as a single rectangular fault plane, then subdivides it into a regular grid of subfault patches: num_subfaults_par x num_subfaults_perp which defaults to 13x10.
Each subfault patch contributes a deformation field over the whole mesh, and the inversion controls (depth/dip/slip/rake) are applied per subfault. So the number of subfaults drives:
- how many Okada contribution fields get computed and replayed in the adjoint, and
- how many control variables the optimiser sees.

I've reduced the grid to 2x2 for testing.

With that, the current testing (pre-PR) format is fine & much faster again. But we can also keep the examples separate (current PR approach) - I'm happy either way.

stephankramer · 2026-05-26T17:16:21Z

That all looks sensible now. The split in adjoint/non-adjoint tests used to be necessary because the tape started running immediately when import firedrake_adjoint. Now the only thing that's different is that teardown thing that you saw, which could also be run on "normal" tests (in the way you've changed it). Anyway, splitting in tests/examples/adjoint is also fine. I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

One final request though. Adding (large-ish) binary-files is not ideal. Could you change it to a sym-link? I think you can just "git rm", then locally make a (weak) symlink and "git add" it back - alternatively you could do some shutil.copy() in the test with a relative path? Either way, if you remove the mesh.msh file, this is a case where it's good to rewrite history (on the branch!): git rebase -i and squash the commit that removes the binary with the one that introduced it. Otherwise, everyone will still be downloading that binary file when they pull as it becomes part of history.

stephankramer

Wonderful, all good afaic!

cpjordan · 2026-05-28T15:57:43Z

I think we previously discussed and agreed that ideally we wouldn't duplicate the example script code in the test but adding functionality to run example scripts in parallel in CI is probably too much effort (and maintenance) - so let's leave that.

@stephankramer - I've added a mechanism to test the examples in parallel (commit 1). Since I did that for the adjoint example we wanted to rest, I also then did the same for the normal examples (commit 2 - for future use, and for tidal_array.py which is currently considerably slower than the other tests).

You are losing a potential speedup of 4x here because there are 8 available cores on each runner. In Firedrake we use firedrake-run-split-tests to ensure utilisation (link).

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

Everything works locally, except I need that teardown test guard or otherwise I get an AttributeError when no tape exists. Apologies to add more review work, but it does remove the duplication which is what you wanted. If this is not the right approach (or you want to leave it as is), I can just undo the commits and merge once the checks have complete.

connorjward · 2026-05-28T16:00:58Z

Since I've now added another slow test, I've implemented thetis-run-split-tests (commit 3). It's basically the same as firedrake-run-split-tests, but has a fallback for if you haven't got GNU parallel available. This allows us to run our MPI parallel tests concurrently.

This seems like a useful contribution. Could this be upstreamed?

cpjordan · 2026-05-29T09:45:20Z

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

connorjward · 2026-05-29T10:12:33Z

This seems like a useful contribution. Could this be upstreamed?

Happy to upstream if we think it's a goal. In Firedrake CI you already guarantee GNU Parallel is installed in the Docker image, and firedrake-run-split-tests assumes it. A fallback/helper is mainly useful if you want firedrake-run-split-tests to work outside the curated CI images (developer laptops/HPC login nodes/minimal images) where GNU Parallel isn't present or users can't install system packages. For Thetis I'm just going to use firedrake-run-split-tests since our CI image will already include GNU Parallel.

I can imagine cases where users may not have it installed. Certainly isn't critical though.

stephankramer · 2026-06-01T13:44:02Z

+    from mpi4py import MPI
+    comm = MPI.COMM_WORLD
+    if comm.rank == 0:
+        workdir = tmp_path_factory.mktemp("thetis-example-tidal-array")


Should this be named tidal-array if in principle can be extended to other examples?

Actually why are we special-casing parallel here at all, why not just use tmp_path?

cpjordan · 2026-06-02T20:47:00Z

See #459.

cpjordan · 2026-06-03T09:38:35Z

@connorjward (& @stephankramer) - I will add another PR for firedrake-run-split-tests, but I have demonstrated the reasoning and validation on this PR with the CI (it can be re-produced locally as well, but I wanted to check whether CI passes even if you had to use a shell-level timeout to end hanging processes).

Tests fail due to timeout but hang: https://github.com/thetisproject/thetis/actions/runs/26841840907/job/79150735012#step:10:612
Tests pass but hang: https://github.com/thetisproject/thetis/actions/runs/26630263069/job/79054723020#step:10:392
Workaround: https://github.com/thetisproject/thetis/actions/runs/26873715620/job/79255338447#step:10:415

The workaround tests work successfully (I just forgot to exclude the script from linting). Even prior to this PR, we saw this hanging behaviour sporadically when doing 2-core MPI tests consecutively rather than concurrently (e.g. https://github.com/thetisproject/thetis/actions/runs/24978421516 - logs are gone but this was an instance).

I also suspect there are a lot of processes that need to be killed on the runners, but I don't have access to them.

cpjordan · 2026-06-11T13:15:49Z

Changes for Firedrake are now done in release - they just need to be merged into main to update the CI image/container. When done I'll switch to firedrake-run-split-tests and then rebase so we have three commits for the separate issues addressed:

parallel MPI tests for an adjoint example
speed up Tohoku inversion
update CI workflow

cpjordan · 2026-06-15T16:19:53Z

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

connorjward · 2026-06-15T16:42:00Z

@connorjward for the Thetis release branch CI we use the latest docker container and for main we use dev-main. The weekly tests are both run from the same main workflow file (which subsequently checks out the relevant branch). We therefore need firedrakeproject/firedrake#5147 and firedrakeproject/firedrake#5150 in main before we can finalise and merge this PR.

Do you want me to follow these instructions to merge these changes into main, or can we expect them to go in soon as part of a batch? I was at the Firedrake meeting last week and I think that you probably have a new release coming soon anyway?

I am handling this in firedrakeproject/firedrake#5178. Should go through tonight or tomorrow morning.

connorjward · 2026-06-15T16:43:05Z

And for your release branch I would recommend using dev-release. Then you'll get the changes there too. I don't know about making a new release.

cpjordan · 2026-06-17T14:01:23Z

And for your release branch I would recommend using dev-release

@stephankramer thoughts? It would mean we are not testing exactly what a user is installing on release (if my understanding on the difference between dev-release and latest containers is correct), unless they have done an editable release install and are pulling the latest release regularly (unlikely).

connorjward · 2026-06-17T14:31:12Z

And for your release branch I would recommend using dev-release

@stephankramer thoughts? It would mean we are not testing exactly what a user is installing on release (if my understanding on the difference between dev-release and latest containers is correct), unless they have done an editable release install and are pulling the latest release regularly (unlikely).

I prefer it this way because then you're testing what a user is going to encounter soon. It gives you time to adapt to anything that breaks before it hits users. It also means that we will hear about breakages sooner.

cpjordan · 2026-06-17T16:37:32Z

I prefer it this way because then you're testing what a user is going to encounter soon.

That makes sense. The difference between latest and dev-release should always be fairly small and the likelihood of something working with one and not the other is very low - if latest isn't working then it's much more likely a Firedrake problem than a Thetis problem (and hence it will likely be updated fairly quickly!). This just happens not to be how things are currently because firedrake-run-split-tests is only updated in dev-release and would currently break our CI the way it is in latest.

I'm fine to switch but will leave it with Stephan to decide.

stephankramer · 2026-06-17T18:19:12Z

Yeah, I can see the argument for both and indeed the chances that it makes a difference are pretty small - but I think the advantages of "knowing things will break before the users notice" outweigh "knowing things are broken for users", so agree with @connorjward, let's switch to dev-release. It's good for us to spot if a "fix" in firedrake release (unadvertedly cause it shouldn't) breaks things for us, so we can report and get it fixed before the next patch release. If we notice a bug outside our CI (say user report) due to a firedrake bug which is then fixed in firedrake release, it may be good if we can then immediately add a regression test in Thetis. Finally, :latest does not get rebuild regularly, so our weekly tests would be repeating the same exercise most of the time, and not spot any other "environment" changes.

cpjordan · 2026-06-18T10:00:26Z

Re-based into 3 clean commits - as reviewed. I have the back-up branch locally if needed but should be good to go. Should be rebased and merged to keep these three commits separate.

stephankramer

Excellent, many thanks

stephankramer · 2026-06-18T11:03:41Z

@connorjward could you take one more look if you're happy?

connorjward

Some comments but ultimately this is CI: if it works in the way that you want then I would call it done and get on with more interesting work!

connorjward · 2026-06-18T11:22:20Z

-            -m parallel[2] thetis-repo/test
+          : # Split the parallel tests into multiple mpiexec jobs to utilise all cores
+          export FIREDRAKE_RUN_SPLIT_TESTS_TIMEOUT=660s
+          export FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER=30s


You can set this at the top level of the file (example)

connorjward · 2026-06-18T11:26:26Z



-def test_examples(example_file, tmpdir, monkeypatch):
+def test_examples(example_file, tmp_path, monkeypatch, request):


We do this sort of thing by importing the file: https://github.com/firedrakeproject/firedrake/blob/main/tests/firedrake/demos/test_demos_run.py#L129

This seems totally fine though.

cpjordan marked this pull request as ready for review December 29, 2025 14:03

cpjordan requested a review from stephankramer January 27, 2026 14:15

connorjward reviewed Jan 27, 2026

View reviewed changes

Comment thread .github/workflows/core.yml Outdated

connorjward requested changes May 6, 2026

View reviewed changes

Comment thread .github/workflows/core.yml Outdated

Comment thread .github/workflows/core.yml Outdated

Comment thread .github/workflows/core.yml Outdated

stephankramer reviewed May 26, 2026

View reviewed changes

Comment thread test_adjoint/conftest.py

cpjordan mentioned this pull request May 28, 2026

Add developer notes to website #458

Closed

cpjordan force-pushed the speed-up-CI-main branch from a945a25 to 608f575 Compare May 28, 2026 13:28

stephankramer previously approved these changes May 28, 2026

View reviewed changes

cpjordan dismissed stephankramer’s stale review via a7d5500 May 28, 2026 15:51

cpjordan mentioned this pull request May 29, 2026

firedrake-run-split-tests: fall back when GNU parallel is missing firedrakeproject/firedrake#5147

Merged

stephankramer reviewed Jun 1, 2026

View reviewed changes

This was referenced Jun 3, 2026

Add outer timeout for mpi testing firedrakeproject/firedrake#5150

Merged

Date aware examples #325

Open

cpjordan added 3 commits June 18, 2026 10:25

Add parallel MPI test for adjoint example

fab2949

Speed up Tohoku inversion regression test

85a762b

Update CI workflow

9e84b9c

cpjordan force-pushed the speed-up-CI-main branch from 4740f06 to 9e84b9c Compare June 18, 2026 09:56

stephankramer approved these changes Jun 18, 2026

View reviewed changes

connorjward approved these changes Jun 18, 2026

View reviewed changes



		def test_examples(example_file, tmpdir, monkeypatch):
		def test_examples(example_file, tmp_path, monkeypatch, request):

Uh oh!

Conversation

cpjordan commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpjordan commented Jan 28, 2026

Uh oh!

cpjordan commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

connorjward commented May 6, 2026

Uh oh!

cpjordan commented May 6, 2026

Uh oh!

connorjward commented May 6, 2026

Uh oh!

cpjordan commented May 22, 2026

Uh oh!

Uh oh!

stephankramer commented May 26, 2026

Uh oh!

stephankramer left a comment

Choose a reason for hiding this comment

Uh oh!

cpjordan commented May 28, 2026

Uh oh!

connorjward commented May 28, 2026

Uh oh!

cpjordan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connorjward commented May 29, 2026

Uh oh!

stephankramer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

stephankramer Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpjordan commented Jun 2, 2026

Uh oh!

cpjordan commented Jun 3, 2026

Uh oh!

cpjordan commented Jun 11, 2026

Uh oh!

cpjordan commented Jun 15, 2026

Uh oh!

connorjward commented Jun 15, 2026

Uh oh!

connorjward commented Jun 15, 2026

Uh oh!

cpjordan commented Jun 17, 2026

Uh oh!

connorjward commented Jun 17, 2026

Uh oh!

cpjordan commented Jun 17, 2026

Uh oh!

stephankramer commented Jun 17, 2026

Uh oh!

cpjordan commented Jun 18, 2026

Uh oh!

stephankramer left a comment

Choose a reason for hiding this comment

Uh oh!

stephankramer commented Jun 18, 2026

Uh oh!

connorjward left a comment

Choose a reason for hiding this comment

Uh oh!

connorjward Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

connorjward Jun 18, 2026

cpjordan commented Dec 28, 2025 •

edited

Loading

cpjordan commented May 29, 2026 •

edited

Loading